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Abstract 

Attempts to understand the consequence of any individual 
scientist's activity within the long-term trajectory of science is 
one of the most difficult questions within the philosophy of sci- 
ence. Because scientific publications play such as central role in 
the modern enterprise of science, bibliometric techniques which 
measure the "impact" of an individual publication as a function 
of the number of citations it receives from subsequent authors 
have provided some of the most useful empirical data on this 
question. Until recently, Thompson/ISI has provided the only 
source of large-scale "inverted" bibliographic data of the sort 
required for impact analysis. In the end of 2004, Google intro- 
duced a new service, GoogleScholar, making much of this same 
data available. Here we analyze 203 publications, collectively 
cited by more than 4000 other publications. We show surpris- 
ingly good agreement between data citation counts provided by 
the two services. Data quality across the systems is analyzed, 
and potentially useful complementarities between are considered. 
The additional robustness offered by multiple sources of such 
data promises to increase the utility of these measurements as 
open citation protocols and open access increase their impact on 
electronic scientific publication practices. 
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1 Background 



Bibliometric analysis of scientific publications goes back to at least the 1970s 
[TT} Ej; similar analysis of judicial opinions has been done by Shep- 
ards/LexisNexis for more than a hundred years. The Institute for Scientific 
Information has made an industry of providing citation data to libraries 
since the mid-1960s; the products are currently available as part of Thom- 
son/ISI (ISI). /S'/ reports that they currently index 16,000 journals, books 
and proceedings PJ. While far from exhaustive (ISI estimates that of the 
2000 new journals reviewed annually, only 10% are selected), the service 
cites "Bradford's Law" that a relatively small number of sources capture 
the bulk of significant scientific results. All articles appearing in selected 
publications have their bibliographies manually transcribed, and "inverted 
bibliographies" pointing from a (earlier) cited work to all (subsequent) citing 
publications is generated to support users' searches. Critically, the trans- 
lation of these bibliographies into distinct records involves a great deal of 
manual effort. 

May has reported extensive analyses of British scientific activity in com- 
parison with other countries, primarily based on ISI 's data [HI Ell • "The 
database has many shortcomings and biases, but overall it gives a wide cov- 
erage of most fields." [10^ p. 793] His critique of shortcomings in this data 
is useful: 

Some problems have to do with the compilation of the database. 
It includes citations of books and chapters in edited books, but 
it does not include the citations in such publications. Other 
publications, such as government and other agency reports and 
working papers, are essentially omitted. It does not cover all 
significant scientific journals.... Papers that describe technical 
methods may attract thousands of reflexive citations, while path- 
breaking papers may be cited only slightly for many years. Re- 
view articles can mask the primary papers they review. Citation 
patterns vary among fields.... Spectacular scientific errors may 
attract many citations.... Self-citation (which accounts for at 
least 10% of all citations) may bias some of the results, ^il^ 
Footnote 3] 

Some of these issues (e.g., having to do with the sources being compiled) 
can be expected to altered by new forms of electronic scientific publication, 
but others (e.g., self-citation) are likely to be more intrinsic to scientific au- 
thoring processes. It is for this reason that Google's recent announcement of 



2 



their Scholar. Google(beta) (GoogleScholar ) service is welcome, as a second, 
independent source of similar data. 

While specifics concerning Google's operation are difficult to come by, 
it is reasonable to assuem that the process relies on more automatic^ algo- 
rithmic procedures than those used by ISI . Linkage structure among Web 
pages is analogous in important ways to scientific publication 01 [Sj. These 
links are captured by Web crawling algorithms as both "citing" pages (i.e., 
Web pages with HTML anchors pointing to other Web pages) and "cited" 
pages are visited, a feature exploited by Google's original "PageRank" re- 
trieval algorithm ^^I- GoogleScholar attempts to bring similar analyses to 
academic publication, despite the fact that these source documents are often 
much less accessible. 

2 Methods 

Given an author's name^, both /S'/and GoogleScholar provide search facili- 
ties that return a list of publications putatively authored by this individual, 
together with the number of times each of these publications has been cited 
by other publications discovered by the service. Six academics were selected 
at random and used as "probe" queries with both systems. ^ Complete 
bibliographies of all publications by these authors were manually reconciled 
against 203 references to these publications returned by one or both sys- 
tems, and then analyzed in detail. Cumulatively, ISI discovered 4741 such 
references, GoogleScholar found 4045. 

Because standards and format of bibliographic citations vary widely 
across different publications, the process of reconciling citation strings from 
different papers to the same target publication is problematic, whether via 
ISI 's manual process or Google's automatic one. It is common, therefore, 
to find the same publication has been treated as more than one record.^ 

For example, manual inspection reveals that a single publication in the 
"Proceedings of the 12th Annual Conference of ACM's Special Interest 
Group in Information Retrieval (SIGIR)" is listed as twelve separate records 
by ISI ; these are shown in Table 1. While most citations to this target publi- 

^ Translation of an author's name into search query string(s) can be ambiguous. In 
these experiments both first letter, and first letter with the middle initial together with 
full last name was used as the author's name. 

^ These academics were all drawn from a single, particularly interdisciplinary academic 
department. 

^The alternative type of error, where citations to multiple, distinct publications are 
confounded as part of the citation record of a single entry, is more difficult to identify 
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PubYear 


CiteString 


IN it at ions 


1989 


12 AlNlN llN 1 ALylVl blGlK 


1 


1989 


12 AlNlN UN 1 U Kbb DbV 


1 




19TTT P ANN TNT APA/T ^ 11 


1 A 
1^ 


1989 


12TH P INT C RES DEV 


1 


1989 


ACM SIGIR INT C RES 


1 


1988 


JUN P ACM SIGIR 88 G 11 


1 


1989 


P 11 INT ACM SIGIR C 


1 


1989 


P 12 ANN INT ACM SIG 


2 


1989 


P 12 ANN INT ACM SIG 11 


16 


1989 


SIGIR 89 11 


2 


1989 


SIGIR FORUM 23 11 


1 


1990 


SIGOIS B 11 48 


1 



Table 1: Citation variations for same publication 



cation have been conveniently collected with respect to two of these records, 
such noisy data makes impact analysis difficult. In these experiments, a 
publication's "impact" is defined as the number of citations found to any of 
the variations resolved to the published work, i.e., the sum is taken across 
all records (manually) identified as referencing the same publication. 

3 Results 

Figure 1 shows how well both systems aggregate individual citations that 
in fact to refer to the same published paper. This shows the cumulative 
probability that one, two, or more publications listed as distinct to by both 
systems in fact refer to the same publication. For example, it shows that 
more than 60% of the articles are represented as unique entries within ISI 
's listing while 85% of them are unique with GoogleScholar . None of the 
articles had more than five separate listings within GoogleScholar , while 
13% had five or more entries in ISI 's system (e.g., the example shown in 
Table 1 had 12). 

Overlap between the two sources of data was relatively small. Of the 203 
citations analyzed, only 78 publications received at least one cited reference 
from each system. However, for this subset the general pattern of agree- 
ment was quite good. Figure 2 shows the number of citations reported by 
GoogleScholar and ISMor the subset of 78 publications. Note that the num- 
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Figure 1: Redundant citation noise 
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Figure 3: Temporal distribution of citations 

ber of citations is plotted on a log-log scale, reflecting the well-known power 
law distribution of citation reference pHj. Based on this sample, there seems 
good evidence (r^ = 0.5023, t = 8.872, /> > 0.005) for a power law relation 
(^GS = 3.1718 * JSI^-^^^^) relating the number of citations reported by the 
two services. 

Figure 3 shows the cumulative number of citations reported by pub- 
lication year of the cited work. An alternative criterion for considering 
the match between systems is to define a "miss" to be a publication for 
which one service has identified three or more citations, but which the other 
service does not capture whatsoever. Figure 4 shows missing citations, 
found by one service but not the other, again distributed by publication 
year. GoogleScholar seems competitive in terms of coverage for materials 
published in the last twenty years; before then IS I seems to dominate. 

Coverage with respect to the two systems can also be analyzed by other 
dimensions of the publications, including publication venue and author. Fig- 
ure 5 aggregates publications into four categories: conference publications. 



7 



--E3--GSMiss 
--0--ISIMiss 



I \ 



/ \ 
/ \ / \ 



^ 



/; \ \ / \ 

// \ \/ \ 

/ fa 0\ \ 



/ \ /\ / / \ , , 

/ \ / \ / / \ / I \ / \ 

/ /\\// \ ' \ / \ A /\\ 

I / \ / L ' \ / i/ \ V 

/ \ / / \ / \ \ 

/ \ / / \ / \ \ 

\ / / \ / \ \ 



O I O I h I 6 r-^ 



Figure 4: Temporal distribution of missing citations 
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Figure 5: Coverage by publication type 
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Figure 6: Coverage for individual authors 



books (or book chapters), journal articles, and other forms of publications 
(e.g, technical reports, dissertations, etc.); tests confirm the distributions 
are distinct. Publications in books (as noted by May, above) and confer- 
ence proceedings are much more likely to be available via GoogleScholar 
; conversely, journal articles are better indexed via ISI . If citations are 
summarized with respect to the six authors analyzed. Figure 6 shows that 
some authors are better represented with respect one service as opposed 
to another. Such variation is to be expected, given that some authors, via 
the publication venues through which they typically report, will be more 
or less well-covered by one service or another. Again, tests confirm the 
distributions are distinct. 
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4 Summary 



Evaluating academics' performance, as individuals or as part of larger so- 
cial groups, in terms of the number of publications they produce is common 
practice. The ability to quantify their "impact" in terms of the number of 
other publications that subsequently choose to cite their work arguably pro- 
vides a more refined and relevant measure. Such data is subject, however, 
to confounding factors ranging from noise in the process of collating and 
"inverting" bibliographic references through intrinsic features of scientific 
publication (e.g., self-citation). The results presented above are therefore 
reassuring in that new evidence provided by GoogleScholar provides the first 
independent confirmation of impact data previously available only from ISI . 
However, analysis across both systems also shows significant variations with 
respect to the two dimensions (authorship and publication type) considered; 
other dimensions of variation are certain to exist. This analysis also revealed 
some problems common to both systems. For example, both services sup- 
port only simple ASCII encodings of author names which are likely to lose 
important character markup (available via Unicode representations) which 
can be especially problematic for authors with foreign names. 

Critically, new services within selected disciplines P3 12] 7 changing stan- 
dards regarding exchange of "open citation" information jS], in combination 
with increased pressure for public access to scientific publications 
soon make some operational difficulties associated with impact analysis ob- 
solete. In the interim, academic deans, science policy advisors and anyone 
else relying on citation count data are cautioned that any individual mea- 
surement requires more context. In the longer term, the increased availabil- 
ity of statistics like bibliographic impact makes it increasingly important 
to understand how publication and citation activities, within both scientific 
publication and Web publishing more generally, can be included as part of 
more holistic evaluations of intellectual contribution [21 . 
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